Triton SM90: AxisInfoAnalysis 维度塌缩与 C++ 底层越界诊断
🚨 物理坠毁点诊断
(在此处粘贴你通过 11D-LMA 输出的底层原理分析…)
🛠️ 现量境爆破指南
(在此处提供一针见血的代码修复或 Workaround…)
@Qubitium @colesbury I’ve conducted a logical audit on this Segmentation Fault. The crash during importlib.external.create_module is a definitive Thread-Safety Violation triggered by the impedance mismatch between Triton’s legacy C-extension architecture and Python 3.13t’s free-threading (no-GIL) runtime.
1. The Root Cause: Race Condition in Module Initialization
In standard Python, the Import Lock (GIL) serializes the loading of C-extensions. Under 3.13t, this protection is gone:
- The Trigger: Your code triggers
@triton.autotuneand@triton.jitduring module import. This activatescompile_module_from_src, which attempts to dynamically create and link a C++ wrapper module for the generated CUDA kernel. - The Failure: Triton’s backend (specifically the driver and JIT compiler) heavily relies on Global Static State to manage the CUDA Driver Context and LLVM PassManager.
- The Segfault: Without GIL serialization, concurrent import attempts or background autotuning threads lead to a race condition where
create_moduleattempts to register symbols against a partially initialized or clobbered pointer, resulting in a memory access violation.
2. Lack of Multi-phase Initialization (PEP 489) Support
Triton’s C-extension has likely not yet adopted Multi-phase Initialization and has not been marked as supporting free-threading (via Py_MOD_GIL_NOT_USED). Consequently, the internal memory allocators (like mimalloc in Python 3.13t) and the CUDA Driver API may experience conflicts when Triton attempts to bridge its un-isolated C++ state with the new lock-free interpreter state.
Temporary Workaround
Until Triton formally refactors its C-extension to be thread-isolated (removing global static pointers in driver.py), you must force the interpreter to restore the GIL:
PYTHON_GIL=1 CUDA_VISIBLE_DEVICES=7 pytest test_mimo.py
If the code executes without error under PYTHON_GIL=1, it confirms that the Segfault is strictly a consequence of Un-isolated Concurrency within the Triton C-API layer.
Verdict: This is an upstream compatibility gap in Triton’s backend rather than a bug in GPTQModel.